MULTID[4,KMC] - www.SailDart.org

perm filename MULTID[4,KMC] blob sn#155781 filedate 1975-04-18 generic text, type T, neo UTF8
	MULTIDIMENSIONAL EVALUATION OF  A SIMULATION
	       OF PARANOID THOUGHT PROCESSES

               KENNETH MARK COLBY
                     AND
              FRANKLIN DENNIS HILF

	Once  a  simulation  model  reaches  a  stage  of   intuitive
adequacy,  a  model  builder  should  consider  using  more stringent
evaluation procedures relevant to the model's purposes. For  example,
if  the  model  is  to serve as a as a training device, then a simple
evaluation of its pedagogic effectiveness would be sufficient.    But
when  the  model  is  proposed  as  an  explantion of a psychological
process, more is demanded of the evaluation procedure.
	We shall first  give  a  brief  description  of  a  model  of
paranoid  processes. A more complete account can be found in Colby,
Weber, and Hilf [1]. We shall then discuss the evaluation
problem which asks "how good is the model?"  or  "how  close  is  the
correspondence  between the behavior of the model and the phenomenena
it is intended to explain?"
       (LEE-- INSERT DESCRIPTION OF MODEL HERE)
	Turing's test has often been suggested as a validation procedure.
It  is  very easy to become confused about Turing's Test.  In
part this is due to Turing  himself  who  introduced  the  now-famous
imitation   game   in   a  paper  entitled  COMPUTING  MACHINERY  AND
INTELLIGENCE (Turing,1950).  A careful reading of this paper  reveals
there  are  actually  two  imitation  games  , the second of which is
commonly called Turing's test.
	In the first imitation game  two  groups  of  judges  try  to
determine which of two interviewees is a woman. Communication between
judge and  interviewee  is  by  teletype.  Each  judge  is  initially
informed  that  one  of the interviewees is a woman and one a man who
will pretend to be a woman. After the interview, the judge  is  asked
what  we shall call the woman-question i.e. which interviewee was the
woman?  Turing does not say what else  the  judge  is  told  but  one
assumes  the  judge is NOT told that a computer is involved nor is he
asked to determine which  interviewee  is  human  and  which  is  the
computer.  Thus,  the  first  group  of  judges  would  interview two
interviewees:    a woman, and a man pretending to be a woman.
	The  second  group  of judges would be given the same initial
instructions, but unbeknownst to them, the two interviewees would  be
a  woman  and a computer programmed to imitate a woman.   Both groups
of judges  play  this  game  until  sufficient  statistical  data are
collected  to  show  how  often the right identification is made. The
crucial question then is:  do the judges decide wrongly AS OFTEN when
the  game  is  played  with man and woman as when it is played with a
computer substituted  for  the  man.  If  so,  then  the  program  is
considered  to  have  succeeded in imitating a woman as well as a man
imitating  a  woman.    For  emphasis  we  repeat;  in   asking   the
woman-question  in  this  game,  judges  are not required to identify
which interviewee is human and which is machine.
	Later  on  in  his  paper  Turing proposes a variation of the
first game. In the second game one interviewee is a man and one is  a
computer.   The judge is asked to determine which is man and which is
machine, which we shall call the machine-question. It is this version
of  the game which is commonly thought of as Turing's test.    It has
often been suggested as a means of validating computer simulations of
psychological processes.
	In  the  course  of  testing a simulation (PARRY) of paranoid
linguistic behavior in a psychiatric interview, we conducted a number
of  Turing-like  indistinguishability  tests  (Colby,  Hilf,Weber and
Kraemer,1972). We say `Turing-like' because none of them consisted of
playing  the  two  games  described above. We chose not to play these
games for a number of reasons which can be summarized by saying  that
they  do  not  meet modern criteria for good experimental design.  In
designing our tests we were primarily  interested  in  learning  more
about   developing   the  model.   We  did  not  believe  the  simple
machine-question to be  a  useful  one  in  serving  the  purpose  of
progressively   increasing  the  credibility  of  the  model  but  we
investigated a variation of it to satisfy the curiosity of colleagues
in artificial intelligence.
	In this design eight psychiatrists  interviewed  by  teletype
two  patients  using  the  technique of machine-mediated interviewing
which involves  what  we  term  "non-nonverbal"  communication  since
non-verbal   cues   are   made  impossible  (Hilf,1972).  Each  judge
interviewed two patients one being PARRY and one being a hospitalized
paranoid  patient.    The  interviewers  were  not  informed  that  a
simulation was involved nor were they asked to identify which was the
machine. Their task was to conduct a diagnostic psychiatric interview
and rate each response from the  `patients'  along  a  0-9  scale  of
paranoidness,  0  meaning  zero  and  9 being highest. Transcripts of
these interviews, without the ratings of the interviewers, were  then
utilized  for  various  experiments in which randomly selected expert
judges conducted evaluations  of  the  interview  transcripts.    For
example,  in one experiment it was found that patients and model were
indistinguishable along the dimension of paranoidness.
	To ask the machine-question, we sent  interview  transcripts,
one  with a patient and one with PARRY, to 100 psychiatrists randomly
selected from the Directory of American Specialists and the Directory
of  the  American Psychiatric Association. Of the 41 replies 21 (51%)
made the correct identification while 20 (49%) were wrong.  Based  on
this  random  sample of 41 psychiatrists, the 95% confidence interval
is between 35.9 and 66.5, a range which  is  close  to  chance.  (Our
statistical   consultant   was  Dr.   Helena  C.   Kraemer,  Research
Associate  in  Biostatistics,  Department  of  Psychiatry,   Stanford
University.)
	Psychiatrists   are   considered  expert  judges  of  patient
interview behavior but they are unfamiliar with computers.  Hence  we
conducted  the  same  test  with  100  computer  scientists  randomly
selected from the membership list of the  Association  for  Computing
Machinery,  ACM.   Of the 67 replies 32 (48%) were right and 35 (52%)
were wrong. Based on this random sample of 67 computer scientists the
95% confidence ranges from 36 to 60, again close to a chance level.
	Thus the answer to this machine-question "can expert  judges,
psychiatrists  aand  computer scientists, using teletyped transcripts
of psychiatric interviews, distinguish between paranoid patients  and
a  simulation  of paranoid processes? " is "No". But what do we learn
from this?   It is some comfort that the answer was not "yes"and  the
null  hypothesis  (no  differences) failed to be rejected, especially
since statistical tests are somewhat biased in favor of rejecting the
null  hypothesis  (Meehl,1967). Yet this answer does not tell us what
we  would  most  like  to  know,  i.e.  how  to  improve  the  model.
Simulation  models  do  not  spring  forth in a complete, perfect and
final form; they must be gradually developed  over  time.  Pehaps  we
might  obtain  a "yes" answer to the machine-question if we allowed a
large number of expert judges to conduct  the  interviews  themselves
rather  than studying transcripts of other interviewers.     It would
indicate that the model must be improved but unless we systematically
investigated how the judges succeeded in making the discrimination we
would not know what aspects of the model to work on. The logistics of
such a design are immense and obtaining a large N of judges for sound
statistical inference would require an effort disproportionate to the
information-yield.
	A more efficient and informative way to use Turing-like tests
is to ask judges to make ordinal ratings along scaled dimensions from
teletyped  interviews.     We  shall  term  this  approach asking the
dimension-question.   One can then compare scaled ratings received by
the patients and by the model to precisely determine where and by how
much they differ.        Model builders  strive  for  a  model  which
shows     indistinguishability     along    some    dimensions    and
distinguishability along others.  That is, the model converges on what
it is supposed to simulate and diverges from that which it is not.
	We  mailed  paired-interview  transcripts  to   another   400
randomly  selected psychiatrists asking them to rate the responses of
the two `patients' along certain dimensions. The judges were  divided
into  groups,  each  judge  being asked to rate responses of each I-O
pair in the interviews along four dimensions.  The  total  number  of
dimensions  in  this  test  were twelve- linguistic noncomprehension,
thought disorder, organic brain syndrome, bizarreness,  anger,  fear,
ideas  of  reference, delusions, mistrust, depression, suspiciousness
and mania. These are dimensions which psychiatrists commonly  use  in
evaluating patients.
	Table 1 shows there were significant differences, with  PARRY
receiving   higher   scores   along   the  dimensions  of  linguistic
noncomprehension,thought disorder, bizarreness, anger,  mistrust  and
suspiciousness. On the dimension of delusions the patients were rated
significantly higher. There were no significant differences along the
dimensions  of  organic  brain  syndrome,fear,  ideas  of  reference,
depression and mania.
	While    tests    asking    the   machine-question   indicate
indistinguishability at  the  gross  level,  a  study  of  the  finer
structure  os  the  model's  behavior  through  ratings  along scaled
dimensions  shows  statistically  significant   differences   between
patients  and  model.     These  differences are of help to the model
builder in suggesting which aspects of the model must be modified and
improved  in  order  to  be  considered an adequate simulation of the
class of paranoid patients it is intended to simulate.  For  example,
it  is  clear  that  PARRY'S language-comprehension must be improved.
Once this has been implemented, a future test will  tell  us  whether
improvement has occurred and by how much in comparison to the earlier
version.   Successive identification of particular areas  of  failure
in the model permits their improvement and the development of
more adequate model-versions.
	Further evidence that the machine-question is too coarse  and
insensitive  a test comes from the following experiment. In this test
we constructed a random version of the paranoid model which  utilized
PARRY'S  output statements but expressed them randomly no matter what
the interviewer said.   Two psychiatrists conducted  interviews  with
this  model, transcripts of which were paired with patient interviews
and sent to 200  randomly  selected  psychiatrists  asking  both  the
machine-question  and  the dimension-question.  Of the 69 replies, 34
(49%) were right and 35 (51%) wrong. Based on this random  sample  of
69  psychiatrists,  the 95% confidence interval ranges from 39 to 63,
again indicating  a  chance  level.  However  as  shown  in  Table  2
significant  differences  appear  along  the dimensions of linguistic
noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
rated  higher.  On  these  particular  dimensions  we can construct a
continuum in which the random version  represents  one  extreme,  the
actual patients another. Our (nonrandom) PARRY lies somewhere between
these two extremes, indicating that it performs significantly  better
than  the  random version but still requires improvement before being
indistinguishable from  patients.(See  Fig.1).  Table  3  presents  t
values   for   differences   between   mean   ratings  of  PARRY  and
RANDOM-PARRY. (See Table 2 and Fig.1 for the mean ratings).
The fact that even a random model can pass the machine-question test
shows, not that the model is a good simulation, but that the test
is  weak and nonchallenging.
	Thus it can be seen that such a multidimensional evaluation
provides  yardsticks  for measuring the adequacy of this or any other
dialogue simulation model along the relevant dimensions.
	We conclude that when model builders want  to  conduct  tests
which  indicate  in  which  direction  progress  lies and to obtain a
measure of whether  progress  is  being  achieved,  the  way  to  use
Turing-like  tests  is  to  ask  expert  judges to make ratings along
multiple dimensions that are essential to the model.  Useful tests do
not  prove  a  model, they probe it for its strengths and weaknesses.
Simply asking the machine-question yields little information relevant
to what the model builder most wants  to  know,  namely,  along  what
dimensions must the model be improved.


		REFERENCES

[1]  Colby, K.M., Weber, S. and Hilf,F.D.,1971. Artificial paranoia.
       ARTIFICIAL INTELLIGENCE,2, 1-25.


[2]  Colby,K.M.,Hilf,F.D.,Weber, S.and Kraemer,H.C.,1972. Turing-like
	indistinguishability tests for the validation  of a  computer
	simulation  of paranoid  processes. ARTIFICIAL  INTELLIGENCE,3,
	199-221.

[3]  Hilf, F.D.,1972. Non-nonverbal communication and psychiatric research.
               ARCHIVES OF GENERAL PSYCHIATRY, 27, 631-635.
[4]  Meehl, P.E.,1967. Theory testing in  psychology  and  physics: a
	methodological paradox. PHILOSOPHY OF SCIENCE,34,103-115.

[5]  Turing,A.,1950. Computing machinery and intelligence. Reprinted in:
	COMPUTERS AND THOUGHT (Feigenbaum, E.A. and Feldman, J.,eds.).
	McGraw-Hill, New York,1963,pp. 11-35.


		ACKNOWLEDGEMENTS

This research is supported by Grant PHS MH 06645-12 from the National
Institute of Mental Health and by (in part) Research Scientist Award
(No. 1-K05-K-14,433) from the National Institute of Mental Health to
the senior author.